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Abstract 

We make progress on two important problems regarding attribute efficient learnability. 

First, we give an algorithm for learning decision lists of length k over n variables using 
20(fc logn examples and time n^^'' This is the first algorithm for learning decision lists 
that has both subexponential sample complexity and subexponential running time in the rel- 
evant parameters. Our approach establishes a relationship between attribute efficient learning 
and polynomial threshold functions and is based on a new construction of low degree, low weight 
polynomial threshold functions for decision lists. For a wide range of parameters our construc- 
tion matches a 1994 lower bound due to Beigel for the ODDMAXBIT predicate and gives an 
essentially optimal tradeoff between polynomial threshold function degree and weight. 

Second, we give an algorithm for learning an unknown parity function on k out of n variables 
using 0(n^~^/'') examples in time polynomial in n. For k = o(logn) this yields a polynomial 
time algorithm with sample complexity o(n). This is the first polynomial time algorithm for 
learning parity on a superconstant number of variables with sublinear sample complexity. 



* Supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship. 



1 Introduction 



1.1 Attribute Efficient Learning 

A central goal in machine learning is to design efficient, effective algorithms for learning from small 
amounts of data. An obstacle to achieving this goal is that learning problems are often characterized 
by an abundance of irrelevant information. In many learning problems each data point is naturally 
viewed as a high dimensional vector of attribute values; as a motivating example, in a natural 
language domain a data point representing a text document may be a vector of word frequencies 
over a lexicon of 100,000 words (attributes). A newly encountered word in a corpus may typically 
have a simple definition which uses only a dozen or so words from the entire lexicon. One would 
like to be able to learn the meaning of such a word using a number of examples which is closer to a 
dozen (the actual number of relevant attributes) than to 100,000 (the total number of attributes). 

Towards this end, an important goal in machine learning theory is to design attribute efficient 
algorithms for learning various classes of Boolean functions. A class C of Boolean functions over n 
variables xi, . . . ,Xn is said to be attribute- efficiently learnahle if there is a poly(n) time algorithm 
which can learn any function f £ C using a number of examples which is polynomial in the "size" 
(description length) of the function / to be learned, rather than in n (the number of features in the 
domain over which learning takes place). (Note that the running time of the learning algorithm 
must in general be at least n since each example is an n-bit vector.) Thus an attribute efficient 
learning algorithm for, say, the class of Boolean conjunctions must be able to learn any Boolean 
conjunction of k literals over xi, . . . ,x„ using poly(A;, log n) examples, since klogn bits are required 
to specify such a conjunction. 

1.2 Decision Lists 

A longstanding open problem in machine learning, posed first by Blum in 1990 [UIHIIHIEI and again 
by Valiant in 1998 [SHI, is to determine whether or not there exist attribute efficient algorithms 
for learning decision lists. A decision list is essentially a nested "if-then-else" statement (we give a 
precise definition in Section [21). 

Attribute efficient learning of decision lists is of both theoretical and practical interest. Blum's 
motivation for considering the problem came from the infinite attribute model in this model 
there are infinitely many attributes but the concept to be learned depends on only a small number 
of them, and each example consists of a finite list of active attributes. Blum et al. [H] showed 
that for a wide range of concept classes (including decision lists) attribute efficient learnability in 
the standard n-attribute model is equivalent to learnability in the infinite attribute model. Since 
simple classes such as disjunctions and conjunctions are attribute efficiently learnable (and hence 
learnable in the infinite attribute model), this motivated Blum [3] to ask whether the richer class 
of decision lists is thus learnable as well.^ Several researchers have subsequently considered this 
problem, see e.g. |6l I10 | [T2 l I29[ l32] : we summarize some of this previous work in Section [1.61 

From an applied perspective. Valiant ,^5, relates the problem of learning decision lists attribute 
efficiently to the question "how can human beings learn from small amounts of data in the presence 
of irrelevant information?" He points out that since decision lists play an important role in various 
models of cognition, a first step in understanding this phenomenon would be to identify efficient 
algorithms which learn decision lists from few examples. Due to the lack of progress in developing 
such algorithms for decision lists. Valiant suggests that models of cognition should perhaps focus 
on "flatter" classes of functions such as projective DNF |35j . 

""^Additional motivation comes from the fact that decision lists have such a simple algorithm in the PAC model. 
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1.3 Parity Functions 

Another outstanding challenge in machine learning is to determine whether there exist attribute 
efficient algorithms for learning parity functions. The parity function on a set of 0/1-valued variables 
Xj^, . . . , Xjj. is equal to Xj^ + • • • + Xj^, modulo 2. As with the class of decision lists, a simple PAC 
learning algorithm is known for the class of parity functions but no attribute efficient PAC learning 
algorithm is known. Learning parity functions plays an important rule in Fourier learning methods 
j27j and is closely related to decoding random linear codes [S]. Both A. Blum jH] and Y. Mansour 
|25j cite attribute efficient learning of parity functions as an important open problem. 

1.4 Our Results: Decision Lists 

We give the first learning algorithm for decision lists that is subexponential in both sample com- 
plexity (in the relevant parameters k and logn) and running time (in the relevant parameter k). 
Our results demonstrate for the first time that it is possible to simultaneously avoid the "worst 
case" in both sample complexity and running time, and thus suggest that it may indeed be possible 
to learn decision lists attribute efficiently. 

Our main learning result for decision lists is: 

Theorem 1 There is an algorithm for learning decision lists over {0, 1}" which, when learning a 
decision list of length k, has mistake hounS 2^^^^ ^Hogn and runs in time rP'^^^ 

We prove Theorem^in two parts; first we generalize Littlestone's well known Winnow algorithm 
j22j for learning linear threshold functions to learn polynomial threshold functions. In previous 
learning results, polynomial threshold functions are learned by applying techniques from linear 
programming: a Boolean function computed by a polynomial threshold function of degree d can be 
learned in time n*^*^*^^ by using polynomial time linear programming algorithms such as the Ellipsoid 
algorithm (see e.g. _20J. In contrast, we use the Winnow algorithm to learn polynomial threshold 
functions. Winnow learns using few examples in a small amount of time provided that the degree 
of the polynomial is low and the integer coefficients of the polynomial are not too large: 

Theorem 2 Let C he a class of Boolean functions over {0, 1}" with the property that each f € C has 
a polynomial threshold function of degree at most d and weight at most W. Then there is an online 
learning algorithm for C which runs inn'^ time per example and has mistake bound 0{W'^ ■ d -logn). 

At this point we have reduced the problem of learning decision lists attribute efficiently to the 
problem of representing decision lists with polynomial threshold functions of low weight and low 
degree. To this end we prove 

Theorem 3 Let L be a decision list of length k. Then L is computed by a polynomial threshold 
function of degree 0{k^^^) and weight 2'^(*^^^^). 

Theorem n follows directly from Theorems |21 and Ol 

Polynomial threshold function constructions have recently been used to obtain the fastest known 
algorithms for a range of important learning problems such as learning DNF formulas 20 , inter- 
sections of halfspaces and Boolean formulas of superconstant depth [SO]. For each of these 
learning problems the sole goal was to obtain fast learning algorithms, and hence the only parameter 

^Throughout this section we use "sample complexity" and "mistake bound" interchangeably; as described in 
Section |5| these notions are essentially identical. 
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of interest in these polynomial threshold function constructions is their degree, since degree bounds 
translate directly into running time bounds for learning algorithms (see e.g. [20]). In contrast, for 
the decision list problem we are interested in both the running time and the number of examples 
required for learning. Thus we must bound both the degree and the weight (magnitude of integer 
coefficients) of the polynomial threshold functions which we use. 

Our polynomial threshold function construction is essentially optimal in the tradeoff between 
degree and weight which it achieves. In 1994 Beigel gave a lower bound showing that any degree d 
polynomial threshold function for a particular decision list must have weight 2^^"/'^ h For d = n^/^, 
Beigel's lower bound implies that the construction stated in Theorem El is essentially optimal. 
Furthermore, for any decision list L of length n and any d < n^/^, we will in fact construct 
polynomial threshold functions of degree d and weight 2'-^("/'^^) computing L. Beigel's lower bound 
thus implies that our degree d polynomial threshold functions are of roughly optimal weight for all 
d < n^/^, and hence strongly suggests that our analysis is the best possible for the algorithm we 
use. 

1.5 Our Results: Parity Functions 

For parity functions, we give an O(n^) time algorithm which can learn an unknown parity on k 
variables out of n using 0{n^~^^^) examples. For values of k = o(logn) the sample complexity of 
this algorithm is o(n). This is the first algorithm for learning parity on a superconstant number of 
variables with sublinear sample complexity. 

The standard PAC learning algorithm for learning an unknown parity function is based on 
viewing a set of m labelled examples as a system of m linear equations modulo 2. Using Gaussian 
elimination it is possible to solve the system and find a consistent parity function. It can be shown 
that the solution thus obtained is a "good" hypothesis if its weight (number of nonzero entries) is 
small relative to m, the number of examples. However, using Gaussian elimination can result in a 
solution of weight as large as min(m, n) even if k (the number of variables in the target parity) is 
very small. Thus in order for this approach to give a successful learning algorithm, it is necessary to 
use m = 0,{n) examples regardless of the value of k. In contrast, observe that an attribute efficient 
algorithm for learning a parity of length k should use only poly(A;, log n) examples. 

Our algorithm works by finding a "low weight" solution to a system of m linear equations. We 
prove that with high probability we can find a solution of weight 0{n^~^^'') irrespective of m. Thus 
by taking m to be only slightly larger than ri^-'^/^ we have that our solution is a "good" hypothesis. 

1.6 Previous Results: Decision Lists 

In previous work several algorithms with different performance bounds (in terms of running time 
and number of examples used) have been given for learning decision lists. 

• Rivest |31j gave the first algorithm for learning decision lists in Valiant's PAC model of 
learning from random examples. Littlestone fSl subsequently gave an analogue of Rivest's 
algorithm in the online learning model. The algorithm can learn any decision list of length k 
in 0{kn?) time using 0{kn) examples. 

• A brute-force approach to learning decision lists of length k is to maintain a collection of all 
such lists which are consistent with the examples seen so far, and to predict at each stage 
using majority vote over the surviving hypotheses. This "halving algorithm" (proposed in 
various forms by Barzdin and Freivald Mitchell and Angluin IJ can learn decision 
lists of length k using only 0(A:logn) examples, but the running time is ■nP'^^\ 
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• Several researchers OISH] have observed that Littlestone's well-known Winnow algorithm |^ 
can learn decision lists of length k from 2^^^^ logn examples in time 2'^'^'^)nlogn. This follows 
from the observation that decision lists of length k can be viewed as linear threshold functions 
with integer coefficients of magnitude 2®^'^) . We note that our algorithm in this paper always 
has improved sample complexity over the basic Winnow algorithm, and for k > (logn)^/^ our 
approach improves on the time complexity of Winnow as well. 

• Finally, several researchers have considered the special case of learning a decision list of length 
k over n variables in which the output bits of the decision list have at most D alternations. 
Valiant [SHI and Nevo and El-Yaniv 29^ have given refined analyses of Winnow's performance 
for this special case, and Dhagat and Hellerstein have also studied this problem. However, 
for the general case in which D can be as large as k, the results thus obtained do not improve 
on the straightforward Winnow analysis described in the previous bullet. 

These previous algorithmic results are summarized in Figure 1. We observe that all of these earlier 
algorithms have an exponential dependence on the relevant parameter(s) (A; and logn for sample 
complexity, k for running time) for either the running time or the sample complexity. 



Reference: 


Number of examples: 


Running time: 


Rivest / Littlestone 


0{kn) 


0(A:n2) 


Halving algorithm 


0{k\ogn) 




Winnow algorithm 


20(fc)logn 




This Paper 




^d(fci/3) 



Table 1: Comparison of known algorithms for learning decision lists of length fc on n variables. 

1.7 Previous Results: Parity Functions 

Little previous work has been published on learning parity functions attribute efficiently in the 
PAC model. The standard PAC learning algorithm for parity (based on solving a system of linear 
equations) is due to Helmbold et al. ^7j; however as described above this algorithm is not attribute 
efficient since it uses i7(n) examples. 

Several authors have considered learning parity attribute efficiently in a model where the learner 
is allowed to make membership queries. Attribute efficient learning is easier in this framework since 
membership queries can help identify relevant variables. Blum et al. [Sj give a randomized poly- 
nomial time membership-query algorithm for learning parity on k variables using only 0{k\ogn) 
examples. These results were later refined by Uehara et al. |34j . 

1.8 Organization 

In Section [21 we give the necessary background on online learning and polynomial threshold func- 
tions. In Section|31we show how known results from learning theory enable us to reduce the decision 
list learning problem to a problem of finding suitable polynomial threshold function representations 
of decision lists. In Sections 14.11 and 14.21 we give two different proofs of a weak tradeoff between 
degree and weight for polynomial threshold function representations of decision lists, and in Sec- 
tion I4.;il we combine these techniques to prove Theorem 13 In Section El we show how to apply our 
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techniques to give a tradeoff between sample complexity and running time for learning decision 
trees. In Section |H1 we discuss the connection with Beigel's ODDMAXBIT lower bound and related 
issues. In Section Q we give our new algorithm for learning parity functions, and in Section |H1 we 
suggest directions for future work. 

2 Preliminaries 

Attribute efficient learning has been chiefly studied in the on-line mistake-bound model of concept 
learning which was introduced in .22, .24 . In this model learning proceeds in a series of trials, 
where in each trial the learner is given an unlabelled boolean example x £ {0, 1}" and must predict 
the value /(x) of the unknown target function /. After each prediction the learner is given the true 
value of f{x) and can update its hypothesis before the next trial begins. The mistake bound of a 
learning algorithm on a target concept c is measured by the worst-case number of mistakes that 
the algorithm makes over all (possibly infinite) sequences of examples, and the mistake bound of 
a learning algorithm on a concept class (class of Boolean functions) C is the worst-case mistake 
bound across all functions f £ C. The running time of a learning algorithm A for a concept class 
C is defined as the product of the mistake bound of j4 on C times the maximum running time 
required by A to evaluate its hypothesis and update its hypothesis in any trial. 

Our main interests in this paper are the classes of decision lists and parity functions. 

A decision list L of length k over the Boolean variables xi, . . . ,Xn is represented by a list of k 
pairs and a bit 

(4, &i), {h, ^2), • • • , (4, h),bk+i 

where each 4 is a literal and each bi is either —1 or 1. Given any x G {0, 1}", the value of L{x) is 
bi if i is the smallest index such that ii is made true by x; if no 4 is true then L{x) = fofc+i. 

A parity function of length k is defined by a set of variables S C {xi, . . . , Xn} such that |5| = k. 
The parity function xsix) takes value 1 on inputs which set an even number of variables in S" to 1 
and takes value —1 on inputs which set an odd number of variables in to 1. 

Given a concept class C over {0, 1}" and a Boolean function f £ C, let size(/) denote the 
description length of / under some reasonable encoding scheme. (Note that if / has r relevant 
variables then size(/) will be at least rlogn since this many bits are required just to specify which 
variables are relevant). We say that a learning algorithm A for C in the mistake-bound model 
is attribute- efficient if the mistake bound of A on any concept c G C is polynomial in size(/). In 
particular, the description length of a length k decision list (parity) is 0{k log n) , and thus we would 
ideally like to have an algorithm which learns decision lists (parities) of length k with a mistake 
bound of poly(A:, log n) and runs in time poly(n). 

(We note here that attribute efficiency has also been studied in other learning models, namely 
Valiant's Probably Approximately Correct (PAC) model of learning from random examples. Stan- 
dard conversion techniques are known ^ I16[ I23j which can be used to transform any mistake bound 
algorithm into a PAC learning algorithm. This transformation essentially preserves the running 
time of the mistake bound algorithm, and the sample size required by the PAC algorithm is essen- 
tially the mistake bound. Thus, positive results for mistake bound learning, such as those we give 
for decision lists in this paper, directly yield corresponding positive results for the PAC model.) 

Finally, our results for decision lists are achieved by a careful analysis of polynomial threshold 
functions. Let / be a Boolean function / : {0,1}" {—1,1} and let p be a polynomial in n 
variables with integer coefficients. Let d denote the degree of p and let W denote the sum of the 
absolute values of p's integer coefficients. If the sign of p{x) equals f{x) for every x £ {0, 1}", then 
we say that p is a polynomial threshold function of degree d and weight W for /. 
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3 Expanded- Winnow: Learning Polynomial Threshold Functions 



Littlestone introduced the online Winnow algorithm in 1988 and showed that it can attribute 
efficiently learn Boolean conjunctions, disjunctions, and low weight linear threshold functions. 
Throughout its execution Winnow maintains a linear threshold function as its hypothesis; at the 
heart of the algorithm is a novel update rule which makes a multiplicative update to each coefficient 
of the hypothesis (rather than an additive update as in the Perceptron algorithm) each time a mis- 
take is made. Since its introduction Winnow has been intensively studied from both applied and 
theoretical standpoints (see e.g. [71 El El and multiplicative updates have become widespread 
in machine learning algorithms. 

The following theorem (which, as noted in j^S], is implicit in Littlestone's analysis in 1221) gives 
a mistake bound for Winnow when learning linear threshold functions: 

Theorem 4 Let f{x) he the linear threshold function signi^^^^ WiXi — 6) where 6 and wi, . . . , Wn 
are integers. Let W = Y17=i Then Winnow learns f{x) with mistake bound O(Ty^logn), and 
uses n time steps per example. 

We will use a generalization of the Winnow algorithm, called Expanded- Winnow, to learn 
polynomial threshold functions of degree at most d. Our generalization introduces X^iLi (^) i^^w 
variables (one for each monomial of degree up to d) and runs Winnow to learn a linear threshold 
function over these new variables. More precisely, in each trial we convert the n-bit received example 
X = (xi, . . . ,x„) into a Yl'l^i (^) bit expanded example (where the bits in the expanded example 
correspond to monomials over x\, . . . ,Xn), and we give the expanded example to Winnow. Thus 
the hypothesis which Winnow maintains - a linear threshold function over the space of expanded 
features - is a polynomial threshold function of degree d over the original n variables xi, . . . , x^. 
Theorem which follows directly from Theorem jll summarizes the performance of Expanded- 
Winnow: 

Theorem [21 Lei C be a class of Boolean functions over {0, 1}" with the property that each f €z C has 
a polynomial threshold function of degree at most d and weight at most W. Then Expanded- Winnow 
algorithm runs in n'^ time per example and has mistake bound 0{W'^ ■ d • logn) for C. 

Theorem 121 shows that the degree of a polynomial threshold function corresponds to Expanded- 
Winnow's running time, and the weight of a polynomial threshold function corresponds to its 
sample complexity. 

4 Constructing Polynomial Threshold Functions for Decision Lists 

In previous constructions of polynomial threshold functions for computational learning theory ap- 
plications |2U1 1191 l3Uj the sole goal has been to minimize the degree of the polynomials regardless 
of the size of the coefficients. As an extreme example, the construction of [20l of 0(n^/^) degree 
polynomial threshold functions for DNF formulae yields polynomials whose coefficients can be dou- 
bly exponential in the degree. In contrast, given Theorem |2l we must now construct polynomial 
threshold functions that have low degree and low weight. 

We give two constructions of polynomial threshold functions for decision lists, each of which 
has relatively low degree and relatively low weight. We then combine these approaches to achieve 
an optimal construction with improved bounds on both degree and weight. 
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4.1 Outer Construction 



Let L be a decision list of length k over variables xi, . . . , x^- We first give a simple construction 
of a degree h, weight ^2^^^^^^^ polynomial threshold function for L which is based on breaking 
the list L into sublists. We call this construction the "outer construction" since we will ultimately 
combine this construction with a different construction for the "inner" sublists. 

We begin by showing that L can be expressed as a threshold of modified decision lists which 
we now define. The set Bh of modified decision lists is defined as follows: each function in is 
a decision list (^1, 61), (£2, ^2), • • • , (^h, ^h), where each £i is some literal over xi, . . . ,Xn and each 
bi G {—1,1}. Thus the only difference between a modified decision list f £ Bh and a normal decision 
list of length h is that the final output value is rather than bh+i G {—1, +1}. 

Without loss of generality we may suppose that the list L is (xi, hi), ... , [x^, bk), bk+i- We break 
L sequentially into k/h blocks each of length h. Let fi £ B^ be the modified decision list which 
corresponds to the i-th block of L, i.e. fi is the list (a:(j_i)fe+i, . . . , {x(^i+i)h,bi^i+i)h),0. 

Intuitively fi computes the ith block of L and equals only if we "fall of the edge" of the ith block. 
We then have the following straightforward claim: 

Claim 5 The decision list L is eqivalent to 

/k/h \ 
stgn\y22'^^''^'M^) + bk+ij ■ (1) 

Proof: Given an input x 7^ O'^ let r = (z — l)/i+c be the first index such that Xr is satisfied. It is easy 
to see that fj{x) = for j < i and hence the value in JH) is 2''/''^^+%. + 2^/'^^^+^ fj{x) + 

bk-\-ii the sign of which is easily seen to be br. Finally if x = O'^ then the argument to is bk+i. □ 
Note: It is easily seen that we can replace the 2 in formula by a 3; this will prove useful later. 

As an aside, note that Claim [3 can already be used to obtain a tradeoff between running time 
and sample complexity for learning decision lists. The class B^ contains at most (An)^ functions. 
Thus as in Section |21 it is possible to run the Winnow algorithm using the functions in Bh as the 
base features for Winnow. (So for each example x which it receives, the algorithm would first 
compute the value of f{x) for each / G Bh, and would then use this vector of U{^))f&Bh values 
as the example point for Winnow.) A direct analogue of Theorem |21 now implies that Expanded- 
Winnow (run over this expanded feature space of functions from Bh) can be used to learn in 
time jf>(h)20(k/h) ^-^j^ mistake bound 2'^(*^/'*)/ilogn. 

However, it will be more useful for us to obtain a polynomial threshold function for L. We can 
do this from Claim El as follows: 

Theorem 6 Let L he a decision list of length k. Then for any h < k we have that L is computed 
by a polynomial threshold function of degree h and weight 4 • ^f^l^^^ _ 

Proof: Consider the first modified decision list f\ = {£i,bi), {£2, b2), {'lih,bh), in the expression 
. For £ a literal let £ denote x if £ is an unnegated variable x and let i denote 1 — x if if ^ is a 
negated variable x. We have that for all x € {0, 1}'*, /i(x) is computed exactly by the polynomial 

/i(x) = £161 + (1 - £i)£2b2 + (1 - £i){l - £2)hbz + --- + {l-£i)---{l- lh-i%bh. 

This polynomial has degree h and has weight at most 2^~^^. Summing these polynomial represen- 
tations for fi, ■ ■ ■ , fk/h in we see that the resulting polynomial threshold function given by 
(UJ) has degree h and weight at most 2^/^^+^ • 2'^+^ = 4 • 2''/'^+'^. □ 
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Specializing to the case h = Vk we obtain: 

Corollary 7 Let L be a decision list of length k. Then L is computed by a polynomial threshold 
function of degree k^/'^ and weight 4 • 2'^^^^'^ . 

We close this section by observing that an intermediate result of |23 can be used to give an 
alternate proof of Corollary [7] with slightly weaker parameters; see Appendix El 

4.2 Inner Approximator 

In this section we construct low degree, low weight polynomials which approximate (in the L^o 
norm) the modified decision lists from the previous subsection. Moreover, the polynomials we 
construct are exactly correct on inputs which "fall off the end": 

Theorem 8 Let f £ Bh be a modified decision list of length h (without loss of generality we may 
assume that f is (xi, . . . , {xh, bh),0)- Then there is a degree 2\//i log h polynomial p such that 

• for every input x G {0, l}'^ we have \p{x) — f{x)\ < 1/h. 

• p(0'*) = f{0^) = 0. 

Proof: As in the proof of Theorem IS] we have that 

f{x) = bixi + 62(1 - xi)x2 H h bh{l - xi) • • • (1 - Xh-i)xh- 

We will construct a lower (roughly ^/h) degree polynomial which closely approximates /. Let Tj 
denote (1 — xi) ... (1 — Xi_i)xi, so we can rewrite / as 

f{x) = biTi + b2T2 + --- + bhTh. 

We approximate each T, separately as follows: set Ai{x) = /i — i + Xj + ^*~\(1 — Xj). Note that 
for X E {0, 1}'', we have Ti{x) = 1 iff Ai{x) = h and Ti{x) = iff < Ai{x) < h - 1. Now define 
the polynomial 

Qi{x) = q {Ai{x)/h) where q{y) = Ca {y (1 + l/h)) . 

As in here Crf(x) is the dth Chebyshev polynomial of the first kind (a univariate polynomial of 
degree d) with d set to • We will need the following facts about Chebyshev polynomials [TTj : 

• \Cd{x)\ < 1 for |x| < 1 with Cd{l) = 1; 

• C'^{x) > (i^ for X > 1 with C^(l) = 

• The coefficients of are integers each of whose magnitude is at most 2'^. 

These first two facts imply that q{\) > 2 but \q{y)\ < 1 for y G [0, 1 — j^]. We thus have that 
Qi(x) = q{l) > 2 if r,(x) = 1 and |Q,(x)| < 1 if Ti{x) = 0. Now define Pi{x) = (^^)^^°^^ • This 
polynomial is easily seen to be a good approximator for Tj: if x G {0, 1}'^ is such that Ti{x) = 1 
then Pi{x) = 1, and if x G {0, 1}'^ is such that Ti{x) = then \Pi{x)\ < (^)^'°'^'' < ^. 

Now define R{x) = XlLi biPi{x) and p{x) = R{x) - R{0^). It is clear that p{0^) = 0. We will 
show that for every input O'* 7^ x G {0, l}'^ we have \p{x) — /(x)| < l/h. Fix some such x; let i be 
the first index such that Xj = 1. As shown above we have Pi{x) = 1. Moreover, by inspection of 
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Tj{x) we have that Tj{x) = for all j 7^ i, and hence |-Pj(x)| < Consequently the value of R{x) 
must lie in [6j — hi + ^^]. Since f{x) = bi we have that p{x) is an Lqo approximator for f{x) 
as desired. 

Finally, it is straightforward to verify that p{x) has the claimed bound on degree. □ 

Strictly speaking we cannot discuss the weight of the polynomial p since its coefficients are 
rational numbers but not integers. However, by multiplying p by a suitable integer (clearing 
denominators) we obtain an integer polynomial with essentially the same properties. Using the 
third fact about Chebyshev polynomials from our proof above, we have that q{\) is a rational 
number N1/N2 where Ni,N2 are each integers of magnitude h^^^\ Each Qi{x) for i = 1,. . . ,h 
can be written as an integer polynomial (of weig ht divided by /i^. Thus each Pi{x) can 

be written as P,(x)/(/i^'^iVi)2i°g'^ where Pi{x) is an integer polynomial of weig ht /iC'(v^i°g'^). It 
follows that p{x) equals p{x)/C, where C is an integer which is at most 2*-'('^^''^ and p is a 

polynomial with integer coefficients and weig ht 2^('^'^'i°g ^\ We thus have 

Corollary 9 Let f £ Bh be a modified decision list of length h. Then there is an integer polynomial 
p{x) of degree iVhlogh and weight 2C('^'/' log' and an integer C = 2'^^^'^^^°^"^^ such that 

• for every input x G {0, l}'^ we have \p{x) — Cf{x)\ < C/h. 

• ^(O'') = f{0^) = 0. 

The fact that p{0^) is exactly will be important in the next subsection when we combine the 
inner approximator with the outer construction. 

4.3 Composing the Constructions 

In this section we combine the two constructions from the previous subsections to obtain our main 
polynomial threshold construction: 

Theorem 10 Let L be a decision list of length k. Then for any h < k, L is computed by a 
polynomial threshold function of degree 0{h^^'^ log h) and weight 2*^(^/'*+^^^' l°g' . 

Proof: We suppose without loss of generality that L is the decision list (xi, 61), ... , {xk, bk), fefc+i. 
We begin with the outer construction: from the note following Claim |21 we have that 



L{x) = sign C 



k/h 
1=1 



where C is the value from Corollaryl^and each fi is a modified decision list of length h computing the 
restriction of L to its ith. block as defined in Subsection 14.11 Now we use the inner approximator 
to replace each Cfi above by pi, the approximating polynomial from Corollary IHl i.e. consider 
sign{H{x)) where 

k/h 

H{x) = ^(S^/'^-^+VXx)) + Cbk+i. 
1=1 

We will show that sign(iJ(x)) is a polynomial threshold function which computes L correctly and 
has the desired degree and weight. 

Fix any x G {0, l}'^. If x = O'^ then by Corollary |^ each Pi{x) is so H{x) = Cb^+i has the 
right sign. Now suppose that r = (i — l)h + c is the first index such that = 1. By Corollary IHl 
we have that 
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• 3''/^~^+''pj{x) = for j < i; 

• 3''/''-*+ip,(x) differs from S^/^-'+'^Cbr by at most C'3'=/'^~*+i • i; 

• The magnitude of each value 3^^^~^^^pj{x) is at most C3^^^^^^^{1 + ^) for j > i. 
Combining these bounds, the value of H{x) differs from 3^/^^'''^^Cbr by at most 



/ nk/h-i+l / 1 



r^k/h—i _|_ 2^k/h~i—l _j_ . . . _|_ g 



+ 1 



which is easily seen to be less than C3^/h~t+i 

in magnitude. Thus the sign of H{x) equals 6r, 
and consequently s\gn[H{x)) is a valid polynomial threshold representation for L{x). Finally, our 
degree and weight bounds from Corollary IHl imply that the degree of H{x) is 0{h^^^ log/i) and the 
weight of H{x) is 2'^('^/'^)+'^('^"'' and the theorem is proved. □ 

Taking h = k"^/^ / \og^^^ k in the above theorem we obtain our main result on representing 
decision lists as polynomial threshold functions: 

Theorem 13] Lei L he a decision list of length k. Then L is computed by a polynomial threshold 
function of degree k^^^ log^^'^ k and weight 2^^^^^^^°^'^''^ ^\ 

Theorem |21 immediately implies that Expanded- Winnow can learn decision lists of length k 
using 2^^'^^''^) logn examples and time n^^''^^^\ 

5 Application to Learning Decision Trees 

In 1989 Ehrenfeucht and Haussler ^5] gave an a time n^(^°ss) algorithm for learning decision trees 
of size s over n variables. Their algorithm uses n*^*-^"^*-* examples, and they asked if the sample 
complexity could be reduced to poly(n, s). We can apply our techniques here to give an algorithm 
using 2^^*^^^^ logn examples, if we are willing to spend n^^^^^^^ time. 

First we need to generalize Theorem 1 101 for higher order decision lists. An r-decision list is like 
a standard decision list but each pair is now of the form {Ci,bi) where Ci is a conjunction of at 
most r literals and as before bi = ±1. The output of such an r-decision list on input x is bi where 
i is the smallest index such that Ci{x) = 1. 

We have the following: 

Corollary 11 Let L be an r-decision list of length k. Then for any h < k, L is computed by a 
polynomial threshold function of degree 0{rh^^'^ logh) and weight 2^'+'^('^/^+'^^''^ . 

Proof: Let L be the r-decision list (Ci, . . . , (C^, 6^), By Theorem [TUl there is a poly- 
nomial threshold function of degree 0{h^^'^ log h) and weight 2'^'^^/^'^^^^^'^°^ over the variables 
Ci, . . . , Cfc. Now replace each variable Ci by the interpolating polynomial which computes it exactly 
as a function from {0, 1}" to {0, 1}. Each such interpolating polynomial has degree r and integer 
coefficients of total magnitude at most 2^', and the corollary follows. □ 

Corollary 12 There is an algorithm for learning r-decision lists over {0, 1}" which, when learning 
an r-decision list of length k, has mistake hound 2'^(''+*^^^^) logn and runs in time nOirk'^^^) _ 
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Now we can apply Corollary^Jto obtain a tradeoff between running time and sample complexity 
for learning decision trees: 



Theorem 13 Let D he a decision tree of size s over n variables. Then D can be learned using 
logn examples in time n^^^^^^\ 

Proof: Blum j5] has shown that any decision tree of size s is computed by a (log s)-decision list of 
length s. Applying Corollary 1121 we thus see that Expanded- Winnow can be used to learn decision 
trees of size s over {0, 1}" with the claimed bounds on time and sample complexity. □ 

6 Lower Bounds for Decision Lists 

Here we observe that our construction from Theorem is essentially optimal in terms of the 
tradeoff it achieves between polynomial threshold function degree and weight. 

In Beigel constructs an oracle separating PP from P'^'^. At the heart of his construction is 
a proof that any low degree polynomial threshold function for a particular decision list, called the 
the ODDMAXBIT„, function, must have large weights: 

Definition 14 The ODDMAXBIT„ function on input x = xi, . . . ,Xn £ {0, 1}" equals (—1)* where 
i is the index of the first nonzero bit in x. 

It is clear that the ODDMAXBIT„ function is equivalent to a decision list of length n: 

(Xl, -1), (X2, 1), {X3, -1), . . . , {Xn, (-1)"), (-1)"+^ 

The main technical theorem which Beigel proves in j^l states that any polynomial threshold function 
of degree d computing ODDMAXBIT„ must have weight 2'^("/'^^): 

Theorem 15 Let p be a degree d polynomial threshold function with integer coefficients computing 
ODDMAXBIT„. Then w = 2^("/'^') where w is the weight of p.^ 

(As stated in [2] the bound is actually w > i2^("/'^^) where s is the number of nonzero coefficients 
in p. Since s < w this implies the result as stated above.) 

A lower bound of 2^*^"^ on the weight of any linear threshold function {d = 1) for ODDMAXBIT„, 
has long been known |2H]; Beigel's proof generalizes this lower bound to all d = 0(n^/^). A matching 
upper bound of 2^^"'^ on weight for d = 1 has also long been known 28 . Our Theorem IIUI gives an 
upper bound which matches Beigel's lower bound (up to logarithmic factors) for all d = 0(n^/^): 

Observation 16 For any d = 0{n^^^) there is a polynomial threshold function of degree d and 
weight 2^("/'^') which computes ODDMAXBIT^. 

n log^ d 

Proof: Set d = /i^/^ j^g Theorem [TCI The weight bound given bv Theorem lTCl is 2 ^ 5^ ' 
which is d{n/d'^) for d = 0{n^/'^). □ 

Note that since the ODDMAXBIT„ function has a polynomial size DNF (see Appendix 1^ . 
Beigel's lower bound gives a polynomial size DNF / such that any degree 0(n^/^) polynomial 
threshold function for / must have weight 2^^^^^^\ This suggests that the Expanded- Winnow 
algorithm cannot learn polynomial size DNF in 2'^("^''^) time from 2"^^^^ " examples for any e > 0, 
and thus suggests that improving the sample complexity of the DNF learning algorithm from [20] 
while maintaining its 2*^^"'^ running time may be difficult. 

^Beigel actually proves something stronger, namely that there must exists a coefRcient whose absolute value is at 
least 2"("/^'). 



11 



7 Learning Parity Functions 



We first briefly review the standard algorithm for learning parity functions. 

The standard algorithm for learning parity functions works by viewing a set of m labelled 
examples as a set of m linear equations over GF(2). Each labelled example {x,b) induces the 
equation Yli-x =i = ^ mod 2. Since the examples are labelled according to some parity function, 
this parity function will be a consistent solution to the system of equations. Using Gaussian 
elimination it is possible to efficiently find a solution to the linear system, which yields a parity 
function consistent with all m examples. The following standard fact from learning theory (often 
referred to as "Occam's Razor") shows that finding a consistent hypothesis suffices to establish 
PAC learnability: 

Fact 17 Let C be a concept class and H a finite set of hypotheses. Set m = l/e(log \H\ + log 1/5) 
where e and 5 are the usual accuracy and confidence parameters for PAC learning. Suppose that 
there is an algorithm A running in time t which takes as input m examples which are labelled 
according to some element of C and outputs a hypothesis h ^ H consistent with these examples. 
Then A is a PAC learning algorithm for C with running time t and sample complexity m. 

Consider using the above algorithm to learn an unknown parity of length at most k. Even though 
there is a solution of weight at most /c, Gaussian elimination (applied to a system of m equations 
in n variables over GF(2)) may yield a solution of weight as large as min(m, n). Using Fact 1171 we 
thus obtain a sample complexity bound of 0(n) examples for learning a parity of length at most k. 

We now present a simple polynomial-time algorithm for learning an unknown parity function 
on k variables using 0{n}~'^^^) examples. To the best of our knowledge this is the first improvement 
on the standard algorithm and analysis given above. 

Theorem 18 The class of all parity functions on at most k variables is learnable in polynomial 
time using Oin^'^^^logn) examples. The hypothesis output by the learning algorithm is a parity 
function on 0{n^~^/^\ogn) variables. 

Proof: If A; = Q.{\ogn) then the standard algorithm suffices to prove the claimed bound. We thus 
assume that k = o(logn). 

Let H be the set of all parity functions of size at most n^~^/^. Note that \H\ < n"^ so 
log|i?| < n^~^/'^logn. Consider the following algorithm: 

1. Choose m = l/e(log|i7| + log(l/5)) examples. Express each example as a linear equation 
over n variables mod 2 as described above. 

2. Randomly choose a set of n — n^^^/'^ variables and assign them the value 0. 

3. Use Gaussian elimination to attempt to solve the resulting system of equations on the re- 
maining n^~^/^ variables. If the system has a solution, output the corresponding parity (of 
size at most n^~^/^) as the hypothesis. If the system has no solution, output "FAIL." 

If the simplified system of equations has a solution, then by Fact El this solution is a good 
hypothesis. We will show that the simplified system has a solution with probability ^}{l/n). The 
theorem follows by repeating steps 2 and 3 of the above algorithm until a solution is found (an 
expected 0{n) repetitions will suffice). 

Let V be the set of k relevant variables on which the unknown parity function depends. It 
is easy to see that as long as no variable in V is assigned a 0, the resulting simplified system of 
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equations will have a solution. Let I = 'n}~^/^ . The probability that in Step 2 the n — £ variables 
chosen do not include any variables in V is exactly ("Z'^) / (") which equals ("Z^) / (") • Expanding 
binomial coefficients we have 



(n—k\ 
\l-k) 

(?) 



n 

i=l 



k + i 



n — k + i 



> 



i-k 



n 




1 

n 



1 + 



2k 



n 



-I k 



(2) 



The bound k = o(logn) implies that (l - j) {l + ^) > (1 - 
^ • ^1 — > 2^ and the theorem is proved. 



Consequently ((3) is at least 

□ 



8 Future Work 

An obvious goal for future work is to improve our algorithmic results for learning decision lists. The 
question still remains: can decision lists of length k be learned in poly(n) time from poly(A:, logn) 
examples? As a first step, one might attempt to extend the tradeoffs we achieve: is it possible to 
learn decision lists of length k in n^^^^ time from poly(A;, log n) examples? 

Another goal is to extend our results for decision lists to broader concept classes. In particular, 
since decision lists are a special case of linear threshold functions, it would be interesting to obtain 
analogues of our algorithmic results for learning general linear threshold functions (independent 
of their weight). We note here that Goldmann et al. JS] have given a linear threshold function 
over {—1, 1}" for which any polynomial threshold function must have weight 2^'^"^ regardless of 
its degree. Moreover Krause and Pudlak [2j have shown that any Boolean function which has a 
polynomial threshold function over {0, 1}" of weight w has a polynomial threshold function over 
{—1,1}" of weight n?"w'^. These results imply that representational results akin to Theorem |31 for 
general linear threshold functions must be quantitatively weaker than Theorem IHl in particular, 
there is a linear threshold function over {0, 1}" with k nonzero coefficients for which any polynomial 
threshold function, regardless of degree, must have weig 

For parity functions, one challenge is to learn parity functions on = 0(logn) variables in 
polynomial time using a sublinear number of examples. Another challenge is to improve the sample 
complexity of learning size k parities from our current bound of 0{p}~^/^). 
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A Alternate Proof of Corollary [7| 

The alternate proof of Corollary [7| is based on the observation that any decision list L = (ii,bi), . . . , 
i^k, bk), bk+i of length k has a A;-term DNF in which each term is a conjunction of at most k literals. 
To see this, note that we obtain a DNF for L simply by taking the OR of all terms £ii2 • • .ii-iii 
for each i such that bi = 1. Now we use the following result from |2()j : 

Theorem 19 (Corollary 12 of [20j) Let f be a DNF formula of s terms, each of length at most 
t. Then there is a polynomial threshold function for f of degree 0(-v/tlogs) and weight fO{Vtiogs) ^ 

Applying this result to the DNF representation for L, we immediately obtain that there is a poly- 
nomial threshold function for L which has degree 

0(fcV2 logk) and weight 20^='/' i°g' (In Section 
14.21 though, we need the construction given in our original proof of Corollary [7|) 
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